Data analysis taking uncertainty into account
The outcome of a random experiment can be described by a random variable.
Example random variables:
Whenever chance is involved in the outcome of an experiment the outcome is a random variable.
A random variable is usually denoted by a capital letter, \(X, Y, Z, \dots\). Values collected in an experiment are observations of the random variable, usually denoted by lowercase letters \(x, y, z, \dots\).
A random variable can not be predicted exactly, but the probability of all possible outcomes can be described.
The population is the collection of all possible observations of the random variable. Note, the population is not always countable.
A sample is a subset of the population.
A discrete random variable can be described by its probability mass function.
| x | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | 6.00 |
| p(x) | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 |
Probability mass function of a die.
The random variable has two possible outcomes; non-smoker (0) and smoker (1). The probability of a random mother being a smoker is 0.44.
| non-smoker | smoker | |
|---|---|---|
| x | 0 | 1 |
| p(x) | 0.61 | 0.39 |
The probability that the random variable, \(X\), takes the value \(x\) is denoted \(P(X=x) = p(x)\). Note that:
When throwing 10 dice, how many dice show 6 dots?
What are the possible outcomes?
Estimate the probability mass function
A Bernoulli trial is a random experiment with two outcomes; success and failure. The probability of success, \(P(success) = p\), is constant. The probability of failure is \(P(failure) = 1-p\).
When coding it is convenient to code success as 1 and failure as 0.
The outcome of a Bernoulli trial is a discrete random variable, \(X\).
| x | 0 | 1 |
| p(x) | p | 1-p |
Also the number of successes in a series of independent and identical Bernoulli trials is a discrete random variable.
\(Y = \sum_{i=0}^n X_i\)
The probability mass function of \(Y\) is called the binomial distribution.
A continuous random variable can be described by its probability density function.
The data set babies consists of data for 1236 male babies and their mothers. All babies are born in Oakland in the 1960s.
The weight of random newborn baby is a continuous random variable, lets call it \(W\). In this example the entire population is known and can be summarized in a histogram.
The probability density function, \(f(x)\), is defined such that the total area under the curve is 1.
\[ \int_{-\infty}^{\infty} f(x) dx = 1 \]
\(P(a \leq X \leq b) = \int_a^b f(x) dx\)
The cumulative distribution function, sometimes called just the distribution function, \(F(x)\), is defined as:
\[F(x) = P(X<x) = \int_{-\infty}^x f(x) dx\]
\[P(X<x) = F(x)\]
\[P(X \geq x) = 1 - F(x)\]
\[P(a \leq X < b) = F(b) - F(a)\]
When the entire population is known, probabilities can be computed by summing the number of observations that fulfil the criteria and divide by the total number.
Weight distribution
library(UsingR)
##The weights are originally in ounces, transform to kg
ounce <- 0.0283495231
wt <- babies$wt*ounce
## P(W > 4.0)
## Count the number of babies with a weight > 4.0 kg
sum(wt>4)
## [1] 133
## How many babies in total
length(wt)
## [1] 1236
## Fraction of babies with weight > 4.0 kg, this is P(W>4.0)
sum(wt>4)/length(wt)
## [1] 0.11
## Another way to compute P(W>4.0)
mean(wt>4)
## [1] 0.11
Based on the babies population, compute the following probabilities
## # A tibble: 5 x 4
## smoke n p code
## <dbl> <int> <dbl> <chr>
## 1 0 544 0.440 never
## 2 1 484 0.392 smokes now
## 3 2 95 0.0769 until current pregnancy
## 4 3 103 0.0833 once did, not now
## 5 9 10 0.00809 unknown
Let \(S\) denote the smoking status of a random mother. The probability that a random mother never smoked: \(P(S=0) = p(0) = 0.4401\) Note that \(S\) is a discrete random variable.
Compute the probability that a smoking mother has a baby with a weight below 2.6 kg.
\[P(W<2.6|S=1)\]
Compute the probability that a mother who never smoked has a baby with a weight below 2.6 kg.
\[P(W<2.6|S=0)\]
| pos | neg | tot | |
|---|---|---|---|
| not cancer | 98 | 882 | 980 |
| cancer | 16 | 4 | 20 |
| total | 114 | 886 | 1000 |
Reported as frequencies, proportions, summarized using mode
Often not necessary to distinguish between interval and ratio scale, can be more useful to divide the quantitative scales into
Useful summary statistics include mean, median, variance, standard deviation.
The expected value of a random variable, or the population mean, is
\[\mu = E[X] = \frac{1}{N}\displaystyle\sum_{i=1}^N x_i,\] where the sum is over all \(N\) data points in the population.
The above formula is probably the most intuitive for finite populations, but for infinite populations other definitions can be used.
For a discrete random variable:
\[\mu = E[X] = \displaystyle\sum_{k=1}^K x_k p(x_k),\]
where the sum is taken over all possible outcomes.
For a continuous random variable:
\[\mu = E[X] = \int_{-\infty}^\infty x f(x) dx\]
\[E(aX) = a E(X)\]
\[E(X + Y) = E(X) + E(Y)\]
\[E[aX + bY] = aE[X] + bE[Y]\]
The variance of a random variable, the population variance, is defined as
\[\sigma^2 = var(X) = E[(X-\mu)^2]\]
\[\sigma^2 = var(X) = \frac{1}{N} \sum_i^N (x-\mu)^2,\] where the sum is over all \(N\) data points in the population.
\[\sigma^2 = var(X) = E[(X-\mu)^2] = \left\{\begin{array}{ll} \displaystyle\sum_{k=1}^K (x_k-\mu)^2 p(x_k) & \textrm{if }X\textrm{ discrete} \\ \\ \displaystyle\int_{-\infty}^\infty (x-\mu)^2 f(x) dx & \textrm{if }X\textrm{ continuous} \end{array}\right.\]
Standard deviation
\[\sigma = \sqrt{var(X)}\]
\[var(aX) = a^2 var(X)\]
For independent random variables X and Y
\[var(aX + bY) = a^2var(X) + b^2var(Y)\]
Consider the below data and summarize each of the variables.
| id | smoker | baby weight (kg) | gender | mother weight (kg) | mother age | parity | married |
|---|---|---|---|---|---|---|---|
| 1 | yes | 2.8 | F | 64 | 21 | 2 | yes |
| 2 | yes | 3.2 | F | 65 | 27 | 1 | yes |
| 3 | yes | 3.5 | M | 64 | 31 | 2 | no |
| 4 | yes | 2.7 | M | 73 | 32 | 0 | yes |
| 5 | yes | 3.3 | F | 59 | 39 | 3 | no |
| 6 | no | 3.7 | M | 61 | 26 | 0 | yes |
| 7 | no | 3.3 | M | 52 | 27 | 2 | no |
| 8 | no | 4.3 | M | 59 | 21 | 0 | no |
| 9 | no | 3.2 | M | 65 | 28 | 1 | yes |
| 10 | no | 3.0 | F | 73 | 33 | 4 | no |
Draw conclusions regarding properties of a population based on observations of a random sample from the population.
The sample mean is denoted \(m = \bar x\). For a sample of size \(n\) the sample mean is:
\[m = \bar x = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i\]
When we only have a sample of size \(n\), the sample mean \(m\) is our best estimate of the population mean. It is possible to show that the sample mean is an unbiased estimate of the sample mean, i.e. the average (over many size \(n\) samples) of the sample mean is \(\mu\).
\[E[\bar X] = \frac{1}{n} n E[X] = E[X] = \mu\]
The sample variance is computed as
\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x-m)^2\] The sample variance is an unbiased estimate of the population variance.
\[E[s^2] = \sigma^2\]
The normal distribution (sometimes referred to as the Gaussian distribution) is a common probability distribution and many continuous random variables can be described by the normal distribution or be approximated by the normal distribution.
The normal probability density function
\[f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\]
describes the distribution of a normal random variable, \(X\), with expected value \(\mu\) and standard deviation \(\sigma\). In short we write \(X \sim N(\mu, \sigma)\).
The bell-shaped normal distributions is symmetric around \(\mu\) and \(f(x) \rightarrow 0\) as \(x \rightarrow \infty\) and as \(x \rightarrow -\infty\).
As \(f(x)\) is well defined, values for the cumulative distribution function \(F(x) = \int_{- \infty}^x f(x) dx\) can be computed.
If \(X\) is normally distributed with expected value \(\mu\) and standard deviation \(\sigma\) we write:
\[X \sim N(\mu, \sigma)\]
Using transformation rules we can define
\[Z = \frac{X-\mu}{\sigma}, \, Z \sim N(0,1)\]
Values of \(F(z)\), the standard normal distribution, are tabulated (and easy to compute in R using the function pnorm).
Some value of particular interest: F(1.64) = 0.95 F(1.96) = 0.975
As the normal distribution is symmetric F(-1.64) = 0.05 F(-1.96) = 0.025
P(-1.96 < Z < 1.96) = 0.95
If \(X \sim N(\mu_1, \sigma_1^2)\) and \(Y \sim N(\mu_2, \sigma_2^2)\) are two independent normal random variables, then their sum is also a random variable:
\[X + Y \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\]
and
\[X - Y \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)\]
The sum of \(n\) independent and equally distributed random variables is normally distributed, if \(n\) is large enough.
As a result of central limit theorem, the distribution of fractions or mean values of a sample follow the normal distribution, at least if the sample is large enough (a rule of thumb is that the sample size \(n>30\)).
The data set fat consists measurements for 252 men, let’s take a closer look at the BMI.
## Population mean
mu <- mean(fat$BMI)
mu
## [1] 25
## Population variance
sigma2 <- var(fat$BMI)/nrow(fat)*(nrow(fat)-1)
sigma2
## [1] 13
## Population standard variance
sigma <- sqrt(sigma2)
sigma
## [1] 3.6
Randomly sample 3, 5, 10, 15, 20, 30 men and compute the mean value, \(m\). Repeat many times to get the distribution of mean values.
Note, mean is just the sum divided by the number of samples \(n\).